Skip to content

fix(shared-log): guard persistCoordinate against TOCTOU race during shutdown#3

Open
Faolain wants to merge 1 commit intofix/pubsub-subscribe-racefrom
fix/rootcause-b-persistcoordinate-guard
Open

fix(shared-log): guard persistCoordinate against TOCTOU race during shutdown#3
Faolain wants to merge 1 commit intofix/pubsub-subscribe-racefrom
fix/rootcause-b-persistcoordinate-guard

Conversation

@Faolain
Copy link
Owner

@Faolain Faolain commented Feb 6, 2026

Summary

This PR adds targeted guards to persistCoordinate in SharedLog to prevent unhandled rejections caused by a Time-of-Check-to-Time-of-Use (TOCTOU) race condition during node shutdown.

The issue: when _close() is called, it nullifies internal indices (_entryCoordinatesIndex, _replicationRangeIndex), but async operations already in the microtask/timer queue (triggered by replication events) still execute persistCoordinate. The existing !this.closed check in findLeaders passes before _close() completes, but by the time persistCoordinate runs, the indices are gone — causing TypeError: Cannot read properties of undefined.

Root Cause Analysis

This was identified as Root Cause B in a systematic investigation of SharedLog teardown races. Three hypotheses were tested in parallel using isolated git worktrees:

Approach Description Errors (baseline: 6) Result
A — Debounce guard Patch rebalanceParticipation + __freqClosed flag 7 errors WORSE — @peerbit/time debouncer holds a direct closure reference, bypassing prototype patches
B — persistCoordinate TOCTOU guard (this PR) Multi-layer guard in persistCoordinate 2 errors BEST — 4 errors eliminated with minimal, targeted fix
C — Comprehensive shutdown guards All of B + guards on handleSubscriptionChange, removeReplicator, rebalanceParticipation + unhandledRejection safety net 2 errors Same reduction, but more invasive with no incremental benefit

Changes

Four-layer guard added to persistCoordinate (~line 3662 in index.ts):

  1. this.closed early bail — catches the common case where close has already completed
  2. Entry hash validation — ensures the replication entry is valid before proceeding
  3. Index existence checks — verifies entryCoordinatesIndex and _replicationRangeIndex haven't been torn down by _close()
  4. Try/catch with post-execution closed check — handles the narrow TOCTOU window where _close() completes mid-execution; swallows the error if we're now closed (expected during shutdown), re-throws otherwise
async persistCoordinate(...) {
  if (this.closed) return;                          // Guard 1
  if (!entry?.hash) return;                         // Guard 2  
  if (!this.entryCoordinatesIndex) return;          // Guard 3
  if (!this._replicationRangeIndex) return;         // Guard 3
  try {
    return await originalPersistCoordinate(...);     // Original logic
  } catch (err) {
    if (this.closed) return;                        // Guard 4
    throw err;
  }
}

Test Results

Methodology: Full vitest test suite (178 tests) run on each worktree. Integration tests (library, playlist, identity, replication.boundaries) also run 3x separately to check for flakiness.

Full Suite Results

Metric Baseline (PR dao-xyz#589 only) This PR
Passed 171 171
Failed 7 (pre-existing) 7 (pre-existing)
Unhandled errors 6 2

Remaining 2 Errors

The remaining 2 unhandled errors originate from @peerbit/program's RPC layer (RPC.close → TypedEventEmitter.dispatchEvent), not from SharedLog. These require a separate fix in the RPC/program teardown path.

Pre-existing Failures (unchanged, present on all branches)

  • chunkedAesGcmV1.test.ts (2 failures)
  • ProfilePage.test.tsx (2 failures)
  • WalletPanel.purchaseForm.test.tsx (flaky, 0-12 failures)
  • RecoverySetupPanel.test.tsx (1 failure)
  • generateSampleMp3.test.ts (1 failure)

Relationship to PR dao-xyz#589

This fix is complementary to the pubsub subscribe race fix in PR dao-xyz#589 (dao-xyz#589). PR dao-xyz#589 fixes the root cause of dropped subscription messages during topic initialization. This PR addresses a separate race condition exposed during teardown — persistCoordinate executing after _close() has torn down internal indices. Both fixes are needed for a clean test suite.

Why This Approach Over Comprehensive Guards (Root Cause C)

Root Cause C added guards to handleSubscriptionChange, removeReplicator, and rebalanceParticipation as well. Testing showed zero incremental error reduction — all 4 eliminated errors came from the persistCoordinate guard alone. The minimal approach (this PR) is preferred because:

  1. Smaller diff, easier to review
  2. No behavioral changes to subscription handling or rebalancing logic
  3. Same error reduction with less risk of unintended side effects

…hutdown

During test teardown (and potentially production close), _close() can
complete between the caller's !this.closed check in findLeaders and
persistCoordinate's actual execution. At that point entryCoordinatesIndex
is undefined, causing an unhandled promise rejection.

This adds:
- this.closed bail-out at method entry
- entryCoordinatesIndex/replicationRangeIndex existence checks
- try/catch wrapper that swallows errors if closed during execution

Validated: reduces unhandled errors from 6 to 2 in integration tests.
The remaining 2 errors originate from @peerbit/program RPC layer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant